Disentangling
  • Home
  • About this project

On this page

  • Overview
    • Models
    • Filtering of unitigs and defining significance threshold
  • Results
    • Significant hits ordered by average effect size
    • Interactive plot of unitigs mapped to genes
    • Summary of annotated regions matching the unitigs (Table)
    • Unitig sequences, significance, effect size and heritability (Table)
    • Unitigs mapped to reference genomes
  • References

Conditional GWAS

Published

January 3, 2023

Overview

Models

Here we present the results of a genome-wide association study (GWAS) conducted using pyseer (John A. Lees et al. 2018), a tool for identifying genetic associations with complex traits. The GWAS was performed using the linear mixed effects model such that the minimum inhibitory concentration (MIC) of Penicillin G: \[\mathbf{y} \sim \mathbf{W} \mathbf{a} + \mathbf{X} \mathbf{b} + \mathbf{K} \mathbf{u} +\epsilon\]

with

\[ \begin{align*} \mathbf{u} \sim \mathcal{N} \left( 0, \sigma^2_g \mathbf{G} \right), \quad &\mathbf{\epsilon} \sim \mathcal{N} \left( 0, \sigma^2_g \mathbf{I} \right) \end{align*} \]

Where \(\mathbf{G}\) is the similarity matrix that was obtained from the phylogenetic relationship of the samples, and \(\mathbf{I}\) is the identity matrix.

Table 1: Information about the GWAS model
Symbol Meaning
𝑦 A vector containing the MIC data for each sample.
𝑊 A design matrix containing the conditional covariates: We used categorical variables for the variants of pencillin binding proteins pbp1a, pbp2b and pbp2x, coded according to the allele scheme proposed in (Y. Li et al. 2016), and benchmarked in (Y. Li et al. 2017). The allele codes were obtained by running the Pathogenwatch pipeline. We also included the study the data was obtained as a categorical variable, to capture differences in MIC measurement protocols.
𝑎 Fixed effects for the covariates.
𝑋 A design matrix containing the presence or absence of the unitigs.
𝑏 Fixed effects of the unitigs.
𝐾 The similarity matrix between all pairs of samples. We used a neighbor-joining phylogeny estimated using PopPUNK (John A. Lees et al. 2019).
𝑢 Random effects for each row of the kinship matrix.
ε Normally distributed error terms.

Thus we assume that the MIC can be predicted by the covariates with the correlation structure implied by the phylogenetic relationship of the samples. If we take the expectation of the equation we obtain \[E(\mathbf{y}) \sim \mathbf{W} \mathbf{a} + \mathbf{X} \mathbf{b} + \mathbf{K} \mathbf{u} +\epsilon\] \[ = E(\mathbf{W} \mathbf{a} + \mathbf{X} \mathbf{b} + \mathbf{K} \mathbf{u} +\epsilon)\] \[ = \mathbf{W} \mathbf{a} + \mathbf{X} \mathbf{b}\] and variance \[V(\mathbf{y}) = V(\mathbf{W} \mathbf{a} + \mathbf{X} \mathbf{b} + \mathbf{K} \mathbf{u} +\epsilon)\] \[ \begin{aligned} V(\mathbf{y}) = V(\mathbf{W} \mathbf{a} + \mathbf{X} \mathbf{b} + \mathbf{K} \mathbf{u} +\epsilon) \\ = V(\mathbf{K} \mathbf{u} + \epsilon) \\ = V(\mathbf{K} \mathbf{u}) + V(\epsilon) \\ = \sigma^2_g \mathbf{G} + \sigma^2_g \mathbf{I} \end{aligned} \]

Filtering of unitigs and defining significance threshold

We filtered the unitigs based on a minimum allele frequency (MAF) between [0.01, 0.99] meaning that the unitigs have to be present in more than 1% and less than 99% of the isolates. We used the number of patterns in the unitigs to define a significance threshold of the p-values of the unitigs. We detected 974212 unique patterns translating to a significance threshold of \(p = 5.13\cdot 10^{-08}\). All downstream results and visualizations were reported for unitigs with a smaller p-value than the threshold. Many of these hits were then annotated using the nomenclature of Bakta (Schwengers et al. 2021). The annotations can be looked up on NCBI for further information.

Results

Significant hits ordered by average effect size

Figure 1: Plot of unitigs mapped to annotated regions of references and assemblies. This scatter plot shows the relationship between the average effect size and the maximum -log10(p-value) of genes associated with penicillin resistance. The size of the points represents the number of unitigs, and the color represents the average minor allele frequency (MAF). Points are labeled with gene name. The plot highlights genes that have high effect size and large -log10(p-value) as well as high unitig count and high MAF value.

Significant hits ordered by minor allele frequency (MAF)

Figure 2: Plot of unitigs mapped to annotated regions of references and assemblies. This scatter plot shows the relationship between the average effect size and the maximum -log10(p-value) of genes associated with penicillin resistance. The size of the points represents the number of unitigs, and the color represents the average minor allele frequency (MAF). Points are labeled with gene name. The plot highlights genes that have high effect size and large -log10(p-value) as well as high unitig count and high MAF value.

Interactive plot of unitigs mapped to genes

In the figures below the unitigs are mapped to reference genomes and. As references we used multiple genomes, and additionally we annotated our

  • Significant hits ordered by effect size
  • Significant hits ordered by MAF

Figure 3: Interactive plot of unitigs mapped to annotated regions of references and assemblies. This scatter plot shows the relationship between the average effect size and the maximum -log10(p-value) of genes associated with penicillin resistance. The size of the points represents the number of unitigs, and the color represents the average minor allele frequency (MAF). Points are labeled with gene name. The plot highlights genes that have high effect size and large -log10(p-value) as well as high unitig count and high MAF value.

Figure 4: Interactive plot of unitigs mapped to annotated regions of references and assemblies. This scatter plot shows the relationship between the average minor allele frequency (MAF) and the maximum -log10(p-value) of genes associated with penicillin resistance. The size of the points represents the number of unitigs, and the color represents the average effect size of the genes. Points are labeled with gene name. The plot highlights genes that have high MAF value and large -log10(p-value) as well as high unitig count and large average effect size.

Summary of annotated regions matching the unitigs (Table)

Figure 5: Gene hits Annotated regions of mapped unitigs. The table summarizes the locations of unitigs in the genome, including the gene annotated region (gene), the number of unitigs in that region (HITS), the highest observed p-value (MAXP), the average minimum allele frequency (AVG_MAF), and the average effect size (AVG_BETA) of unitigs within the region. The table also includes a column (NAMES_FOR_DISPLAY) to indicate regions that have common gene names, which match the names in the interactive figures above.

Unitig sequences, significance, effect size and heritability (Table)

Figure 6: Unitig sequences. The table displays various characteristics of the significant unitigs including allele frequency (AF), p-value (filter-pvalue), likelihood (lrt-pvalue), effect size (beta), estimated standard error of the effect size (beta-std-err), estimated heritability of the unitig sequence (variant_h2), and the annotated region in which the unitig is found (annotation) as well as the nearest upstream and downstream annotated regions (gene-upstream gene-downstream gene).

Unitigs mapped to reference genomes

In total we extracted 2527 significant unitigs. We mapped these unitigs to different reference genomes, summarized in the table below. Streptococcus pneumoniae has a large pan-genome, and none of the references contain all of the unitigs. Note that a high fraction of the unitigs were found in ST556, which represents a sequence from a multidrug-resistant isolate from an otitis media patient (G. Li et al. 2012).

Table 2: Information about the reference strains
Reference strain Description
ATCC70066 Multidrug-Resistant Pandemic Clone Streptococcus pneumoniae. 885 out of 2527 unitigs were mapped to this genome. https://doi.org/10.1128/JB.01343-08
Hu17 high-level beta-lactam resistance and penicillin sensitive phenotype. 719 out of 2527 unitigs were mapped to this genome. https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_002076835.1/
EF3030 A serotype 19F isolate that colonizes the nasopharynx of mice while being mostly noninvasive. 1359 out of 2527 unitigs were mapped to this genome. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6509521/
ST556 A Multidrug-Resistant Isolate from an Otitis Media Patient. 2147 out of 2527 unitigs were mapped to this genome. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3370858/
TIGR4 From the paper: The acquisition of clinically relevant amoxicillin resistance in Streptococcus pneumoniae requires ordered horizontal gene transfer of four loci. 774 out of 2527 unitigs were mapped to this genome. https://pubmed.ncbi.nlm.nih.gov/35877768/
SN75752 From the paper: The acquisition of clinically relevant amoxicillin resistance in Streptococcus pneumoniae requires ordered horizontal gene transfer of four loci. 748 out of 2527 unitigs were mapped to this genome. https://pubmed.ncbi.nlm.nih.gov/35877768/
  • ATCC700669
  • Hu17
  • EF3030
  • ST556
  • TIGR4
  • SN75752

Figure 7: Unitigs mapped to various reference genomes. The reference strains used for the mapping are described in the table above. The data displayed when hovering over the points in the figure represents the annotation attributes extracted from the genes that overlap with the starting position of the unitig (basepair position). It’s noteworthy that the “product” and “gene” attributes can provide valuable insights into the function of the genomic regions that the unitigs overlap with.

References

Lees, John A, Marco Galardini, Stephen D Bentley, Jeffrey N Weiser, and Jukka Corander. 2018. “Pyseer: A Comprehensive Tool for Microbial Pangenome-Wide Association Studies.” Edited by Oliver Stegle. Bioinformatics 34 (24): 4310–12. https://doi.org/10.1093/bioinformatics/bty539.
Lees, John A., Simon R. Harris, Gerry Tonkin-Hill, Rebecca A. Gladstone, Stephanie W. Lo, Jeffrey N. Weiser, Jukka Corander, Stephen D. Bentley, and Nicholas J. Croucher. 2019. “Fast and Flexible Bacterial Genomic Epidemiology with PopPUNK.” Genome Research 29 (2): 304–16. https://doi.org/10.1101/gr.241455.118.
Li, Guiling, Fen Z. Hu, Xianwei Yang, Yujun Cui, Jun Yang, Fen Qu, George F. Gao, and Jing-Ren Zhang. 2012. “Complete Genome Sequence of Streptococcus Pneumoniae Strain ST556, a Multidrug-Resistant Isolate from an Otitis Media Patient.” Journal of Bacteriology 194 (12): 3294–95. https://doi.org/10.1128/jb.00363-12.
Li, Yuan, Benjamin J. Metcalf, Sopio Chochua, Zhongya Li, Robert E. Gertz, Hollis Walker, Paulina A. Hawkins, et al. 2016. “Penicillin-Binding Protein Transpeptidase Signatures for Tracking and Predicting β-Lactam Resistance Levels in Streptococcus Pneumoniae.” Edited by James M. Hughes. mBio 7 (3). https://doi.org/10.1128/mbio.00756-16.
Li, Yuan, Benjamin J. Metcalf, Sopio Chochua, Zhongya Li, Robert E. Gertz, Hollis Walker, Paulina A. Hawkins, Theresa Tran, Lesley McGee, and Bernard W. Beall. 2017. “Validation of β-Lactam Minimum Inhibitory Concentration Predictions for Pneumococcal Isolates with Newly Encountered Penicillin Binding Protein (PBP) Sequences.” BMC Genomics 18 (1). https://doi.org/10.1186/s12864-017-4017-7.
Schwengers, Oliver, Lukas Jelonek, Marius Alfred Dieckmann, Sebastian Beyvers, Jochen Blom, and Alexander Goesmann. 2021. “Bakta: Rapid and Standardized Annotation of Bacterial Genomes via Alignment-Free Sequence Identification.” Microbial Genomics 7 (11). https://doi.org/10.1099/mgen.0.000685.